{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Automated Data Quality Checks with the Data Profiler and Great Expectations\n",
    "\n",
    "## Overview\n",
    "An engineer can leverage the DataProfiler tool in order to get columnar and tabular metadata metrics that describe their data. An engineer can use the DataProfiler `diff()` functionality to compare two profiles from one another. However, this example is going to run through a scenario where an engineer wants to have automated data quality checks set up for future data that will be aggregated to the original dataset.\n",
    "\n",
    "In this example, an engineer is going to generate a DataProfiler report on an initial batch of data and then use the report to set up automated data quality checks. Then the engineer will run those checks against a future batch of data in order to identify any data quality concerns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download the Data\n",
    "Download the csv that contains the data we will be using for this example which can be found [here](https://github.com/great-expectations/gx_tutorials/blob/main/data/yellow_tripdata_sample_2019-01.csv) then save it to the `/examples/great_expectations/data` directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports\n",
    "First we need to import great expectations and the data profiler."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import great_expectations as ge\n",
    "from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists\n",
    "from great_expectations.core.batch import BatchRequest\n",
    "from great_expectations.checkpoint import SimpleCheckpoint\n",
    "from great_expectations.rule_based_profiler.data_assistant_result import (\n",
    "    DataAssistantResult,\n",
    ")\n",
    "import dataprofiler as dp\n",
    "\n",
    "from capitalone_dataprofiler_expectations.rule_based_profiler.data_assistant.data_profiler_structured_data_assistant import (\n",
    "    DataProfilerStructuredDataAssistant,\n",
    ")\n",
    "from capitalone_dataprofiler_expectations.rule_based_profiler.data_assistant_result.data_profiler_structured_data_assistant_result import (\n",
    "    DataProfilerStructuredDataAssistantResult,\n",
    ")\n",
    "import capitalone_dataprofiler_expectations.metrics.data_profiler_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Great Expecations Set Up\n",
    "Next you will need to create and load the DataContext, build their batch request, retrieve the validator from the context, and create a checkpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare Batch Request\n",
    "context = ge.get_context()\n",
    "datasource_config = f\"\"\"\n",
    "    name: taxi_multi_batch_datasource\n",
    "    class_name: Datasource\n",
    "    module_name: great_expectations.datasource\n",
    "    execution_engine:\n",
    "        module_name: great_expectations.execution_engine\n",
    "        class_name: PandasExecutionEngine\n",
    "    data_connectors:\n",
    "        inferred_data_connector_all_years:\n",
    "            class_name: InferredAssetFilesystemDataConnector\n",
    "            base_directory: ../data\n",
    "            default_regex:\n",
    "                group_names:\n",
    "                    - data_asset_name\n",
    "                    - year\n",
    "                    - month\n",
    "                pattern: (yellow_tripdata_sample)_(\\\\d.*)-(\\\\d.*)\\\\.csv\n",
    "\"\"\"\n",
    "context.test_yaml_config(yaml_config=datasource_config)\n",
    "sanitize_yaml_and_save_datasource(context, datasource_config, overwrite_existing=False)\n",
    "batch_request = {\n",
    "    \"datasource_name\": \"taxi_multi_batch_datasource\",\n",
    "    \"data_connector_name\": \"inferred_data_connector_all_years\",\n",
    "    \"data_asset_name\": \"yellow_tripdata_sample\",\n",
    "    \"data_connector_query\": {\"index\": 0},\n",
    "}\n",
    "\n",
    "# Prepare a new expectation suite\n",
    "context = ge.data_context.DataContext()\n",
    "expectation_suite_name = \"yellow_tripdata_sample\"\n",
    "expectation_suite = context.create_expectation_suite(\n",
    "    expectation_suite_name=expectation_suite_name, overwrite_existing=True\n",
    ")\n",
    "\n",
    "validator = context.get_validator(\n",
    "    batch_request=BatchRequest(**batch_request),\n",
    "    expectation_suite_name=expectation_suite_name,\n",
    ")\n",
    "\n",
    "checkpoint_config = {\n",
    "    \"class_name\": \"SimpleCheckpoint\",\n",
    "    \"validations\": [\n",
    "        {\n",
    "            \"batch_request\": batch_request,\n",
    "            \"expectation_suite_name\": expectation_suite_name,\n",
    "        }\n",
    "    ],\n",
    "}\n",
    "\n",
    "checkpoint = SimpleCheckpoint(\n",
    "    f\"{validator.active_batch_definition.data_asset_name}_{expectation_suite_name}\",\n",
    "    context,\n",
    "    **checkpoint_config,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the data assistant and save the expectation suite\n",
    "Now we will pass the profile path into the data profiler data assistant to generate the expectation suite."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "profile_path: str = os.path.join(os.getcwd(), \"data/yellow_tripdata_sample_2019-01.csv\")\n",
    "saved_profile_path: str = os.path.join(os.getcwd(), \"data/yellow_tripdata_sample_2019-01.pkl\")\n",
    "data = dp.Data(profile_path)\n",
    "profiler_options = dp.ProfilerOptions()\n",
    "profiler_options.set({\"data_labeler.is_enabled\": False})\n",
    "profiler = dp.Profiler(data)\n",
    "profiler.save(filepath=\"data/yellow_tripdata_sample_2019-01.pkl\")\n",
    "report = profiler.report()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exclude_column_names = [\n",
    "    # \"vendor_id\",\n",
    "    \"pickup_datetime\",\n",
    "    \"dropoff_datetime\",\n",
    "    # \"passenger_count\",\n",
    "    # \"trip_distance\",\n",
    "    # \"rate_code_id\",\n",
    "    \"store_and_fwd_flag\",\n",
    "    # \"pickup_location_id\",\n",
    "    # \"dropoff_location_id\",\n",
    "    # \"payment_type\",\n",
    "    # \"fare_amount\",\n",
    "    # \"extra\",\n",
    "    # \"mta_tax\",\n",
    "    # \"tip_amount\",\n",
    "    # \"tolls_amount\",\n",
    "    # \"improvement_surcharge\",\n",
    "    # \"total_amount\",\n",
    "    \"congestion_surcharge\",\n",
    "]\n",
    "\n",
    "\n",
    "result: DataAssistantResult = context.assistants.data_profiler.run(\n",
    "    batch_request=batch_request,\n",
    "    exclude_column_names=exclude_column_names,\n",
    "    numeric_rule={\n",
    "        \"profile_path\": saved_profile_path,\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run validation against the expectation suite\n",
    " Then the expectation suite will be saved to the validator. Then the checkpoint will be run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validator.expectation_suite = result.get_expectation_suite(\n",
    "    expectation_suite_name=expectation_suite_name\n",
    ")\n",
    "\n",
    "validator.save_expectation_suite(discard_failed_expectations=False)\n",
    "\n",
    "checkpoint_result = checkpoint.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build and display the data docs\n",
    "The checkpoint results from above will be used to populate the data docs which will display the expectation suite validation results again the new data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "context.build_data_docs()\n",
    "\n",
    "validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]\n",
    "context.open_data_docs(resource_identifier=validation_result_identifier)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
